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Abstract. Existing grammar frameworks do not work out particularly 
well for controlled natural languages (CNL), especially if they are to be 
used in predictive editors. I introduce in this paper a new grammar nota- 
tion, called Codeco, which is designed specifically for CNLs and predictive 
editors. Two different parsers have been implemented and a large subset 
of Attempto Controlled English (ACE) has been represented in Codeco. 
The results show that Codeco is practical, adequate and efficient. 



1 Introduction 

Controlled natural languages (CNL) like Attempto Controlled English (ACE) [3] 
and others have been proposed to make knowledge representation systems more 
user- friendly |10I5| . It can be very difficult, however, for users to write state- 
ments that comply with the restrictions of CNLs. Three approaches have been 
proposed so far to solve this problem: error messages [5], language generation 
[S] , and predictive editors jll|5[ . Each of the approaches has its own advantages 
and drawbacks. In my view, the predictive editor approach — where the editor 
shows all possible continuations of a partial sentence — is the most elegant one. 
However, implementing lookahead features (i.e. retrieving the possible contin- 
uations for a given partial sentence on the basis of a given grammar) is not a 
trivial task, especially if the grammar describes complex nonlocal structures like 
anaphoric references. Furthermore, good tool integration is crucial for CNLs, 
no matter which approach is followed. For this reason, it is desirable to have a 
simple grammar notation that is fully declarative and can easily be implemented 
in different kinds of programming languages. 

These requirements are not satisfied by existing grammar frameworks. Natu- 
ral language grammar frameworks like Head-driven Phrase Structure Grammars 
(HPSG) are unnecessarily complex for CNLs and do not allow for the imple- 
mentation of efficient and reliable lookahead features. Grammar frameworks for 
formal languages — called parser generators — like Yacc |3] do not allow for 
the declarative definition of complex (i.e. context-sensitive) structures. Definite 
clause grammars (DCG) [7] are a good candidate but they have the problem 
that they are hard to implement in programming languages that are not logic- 
based. Furthermore, all these grammar frameworks have no special support for 



the deterministic resolution of anaphoric references, as required by languages like 
ACE. The Grammatical Framework (GF) [T] is a recent grammar framework in 
competition with the approach to be presented here. It allows for declarative 
definition of grammar rules and can be used in predictive editors, but it lacks 
support for anaphoric references. 

The following example illustrates the difficulty of lookahead in the case of 
complex nonlocal structures like anaphoric references. The partial ACE sentence 
"every man protects a house from every enemy and does not destroy ..." can be 
continued by several anaphoric structures referring to earlier objects in the sen- 
tence: "the man" or "himself" can be used to refer to "man"; "the house" or "it" 
can be used to refer to "house" . It is not possible, however, to refer to "enemy" , 
because it is under the scope of the quantifier "every" and such references from 
outside the scope of a quantified entity do not make sense logically.^ Thus, a pre- 
dictive editor should in this case propose "the man" and "the house" as possible 
continuations of the given partial sentence, but not "the enemy" . 

2 Codeco 

In order to solve the problems sketched above, I created a new grammar notation 
called Codeco, which stands for "concrete and declarative grammar notation for 
controlled natural languages" . This grammar notation is designed specifically 
for CNLs to be used in predictive editors. Below is a brief description of the 
most important features of Codeco. 

In a nutshell, grammar rules in Codeco consist of grammatical categories with 
flat feature structures^. One of the important features of Codeco is that ana- 
phoric references and their potential antecedents can be represented by special 
categories for backward references "<" and forward references ">" , respectively. 
These special categories establish nonlocal dependencies across the syntax tree in 
the sense that each backward reference needs a matching (i.e. unifiable) forward 
reference somewhere in the preceding text. 

Furthermore, scopes can be defined that make the contained forward refer- 
ences inaccessible from the outside. In Codeco, scopes are opened by the special 
category "//" and are closed by the use of scope-closing rules "^" . In contrast 
to normal rules "^'" , scope-closing rules close at their end all scopes that have 

^ The difference between "every man" and "every enemy" in the given example is not 
obvious. The scope that is opened by the quantifier "every" of "every enemy" closes 
after "enemy", because the verb phrase ends there. In ACE, verb phrases — but not 
only verb phrases — close at their end all scopes that are opened within (which is, 
in general, also true for full natural English). Since "every man" is not part of that 
verb phrase, the scope for "every man" is still open at the end of the partial sentence. 
This is why "man" is accessible for anaphoric references, but "enemy" is not. See 
also Fig. [T] 

^ The restriction to flat feature structures (instead of general ones) restricts the ex- 
pressive power of Codeco while making it easier to process and implement. The 
results suggest that Codeco is sufficiently expressive for common CNLs. 



been opened by one of their child categories. In this way, Codeco allows us to 
define the positions where scopes open and close in a simple and fully declar- 
ative way. Other extensions that are not discussed here are needed to handle 
pronouns, variables, and references to proper names in the desired way. See [5] 
for the details. 

The following grammar rules are examples that show how the special struc- 
tures of Codeco are used: 




The first rule represents existentially quantified noun phrases like "a house" and 
establishes a forward reference to define its status as a potential antecedent 
for subsequent anaphoric references. The second rule describes a definite noun 
phrase like "the house" and declares its anaphoric property by a backward refer- 
ence. The third rule describes the determiner "every" opening a scope through 
the use of the scope opener category. The last rule describes a verb phrase and 
is scope-closing, which means that the end of such a verb phrase also marks the 
end of all scopes that have been opened within. 

Codeco grammars allow for the correct and efficient implementation of looka- 
head features. Figure [1] shows a possible syntax tree of the example above and 
visualizes how the possible anaphoric references can be found. 

Two independent parsers have been implemented for Codeco: one that exe- 
cutes Codeco grammars as Prolog DCGs and another one that is a chart parser 
implemented in Java. These two implementations have been shown to be equiva- 
lent and efficient [6] . The chart parser for Codeco is used in the predictive editors 
of AceWiki [5] and the ACE Editor'^. The DCG parser is used for regression and 
ambiguity tests for the grammars of both tools. 

The Codeco grammar that is used in the ACE Editor consists of 164 grammar 
rules and describes a large subset of ACE covering nouns, proper names, intran- 
sitive and transitive verbs, adjectives, adverbs, prepositions, variables, plurals, 
negation, prepositional phrases, relative clauses, modality, numerical quantifiers, 
various kinds of coordination, conditional sentences, questions, anaphoric pro- 
nouns, and more. Currently unsupported ACE features are mass nouns, mea- 
surement nouns, ditransitive verbs, numbers and strings as noun phrases, sen- 
tences as verb phrase complements, Saxon genitive, possessive pronouns, noun 
phrase coordination, and commands. Nevertheless, this subset of ACE is — to 



|http: //attempt o . if i .uzh. ch/webapps/aceeditor/ 




Fig. 1. This figure shows an exemplary syntax tree of a partial ACE sentence. 
Head nodes of scope-closing rules are marked with The positions where 

scopes open and close are marked by parentheses. The shaded area marks a 
closed scope, which is not accessible from the outside. The dashed lines indicate 
the possible references. 



my knowledge — the broadest subset of English that has ever been defined in a 
formal and fully declarative way and that includes complex issues like anaphoric 
references. 

Codeco grammars can be checked for syntactic ambiguity by exhaustive lan- 
guage generation up to a certain sentence length. Due to combinatorial explosion, 
it is advisable not to check the complete language but to define a core subset 
thereof. Such a core subset should use a minimal lexicon and should have a re- 
duced number of grammar rules, in a way that language structures that could 
possibly lead to complex cases of ambiguity are retained. For the ACE Codeco 
grammar introduced above, such a core subset has been defined, consisting of 97 
of the altogether 164 grammar rules. Exhaustive language generation of this core 
subset up to ten tokens leads to 2'250'869 sentences, which are all different. Since 
syntactically ambiguous sentences would be generated more than once (because 
they have more than one syntax tree), this proves that at least the core of the 
ACE Codeco grammar is unambiguous, at least for sentences up to ten tokens. 
In the same way, it can be proven that the core of the ACE Codeco grammar is 
indeed a subset of ACE. 

See my thesis |6] for the complete description of Codeco, the full ACE Codeco 
grammar, complexity considerations, the details and results of the performed 
tests and the involved algorithms. 

3 Conclusions 

In summary, the Codeco notation allows us to define controlled subsets of natural 
languages in an adequate way. The resulting grammars are fully declarative, can 



be efficiently interpreted in different kinds of programming languages, and allow 
for lookahead features, which is needed for predictive editors. Furthermore, it 
allows for different kinds of automatic tests, which is very important for the 
development of reliable practical applications. 

Codeco is a proposal for a general CNL grammar notation. It cannot be 
excluded that extensions become necessary to express the syntax of other CNLs, 
but Codeco has been shown to work very well for a large subset of ACE, which 
is one of the most advanced CNLs to date. 
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